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Abstract 

The primary objective of this paper is to revisit a widely held view 
that decision theory provides a unifying framework for comparing the fre- 
quentist and Bayesian approaches by bringing into focus their common 
features and neutralizing their differences using a common terminology 
like decision rules, action spaces, loss and risk functions, admissibility, etc. 
The paper calls into question this viewpoint and argues that the decision 
theoretic perspective misrepresents the frequentist viewpoint primarily 
because the notions of expected loss and admissibility are inappropriate 
for frequentist inference; they do not represent legitimate error probabil- 
ities that calibrate the reliability of inference procedures. In a nutshell, 
the decision theoreric framing is applicable to what R. A. Fisher called 
" acceptance sampling" , where the decisions revolve around a loss function 
originating in information 'other than the data'. Frequentist inference is 
germane to scientific inference where the objective is to learn from data 
about the 'true' data generating mechanism. 



1 Introduction 



A widely held view in statistics is that Wald's (1950) decision-theoretic frame- 
work provides a broad enough perspective that can accommodate both the fre- 
quentist and Bayesian approaches to inference, despite their well-known differ- 
ences. Indeed, it is often regarded as a unifying framework for comparing these 
approaches by bringing into focus their common features and neutralizing their 
differences using a common terminology based on decision rules, action spaces, 
loss and risk functions, admissibility, etc.; see Berger (1985), Robert (2007). 

Historically, Wald (1939) proposed the original decision-theoretic framework 
as a way to unify frequentist estimation and testing: 

"The problem in this formulation is very general. It contains the prob- 
lems of testing hypotheses and of statistical estimation treated in the 
literature." citing Neyman (1937) in a footnote (p. 299) 

Among the frequentist pioneers, Jerzy Neyman accepted enthusiastically this 
broader perspective in the early 1950s, primarily because it seemed to provide 
a formalization for his behavioristic interpretation of Neyman-Pearson (N-P) 
testing based on the accept/reject rules; see Neyman (1952). Neyman's attitude 
towards Wald's (1950) framing was also adopted wholeheartedly by some of his 
most influential students and colleagues at Berkeley, including Lehmann (1959) 
and LeCam (1986). In the forward to the collection of Neyman's early papers 
published in 1966, Neyman's students involved in selecting his papers to be 
reprinted write: 

"The concepts of confidence intervals and of the Neyman-Pearson the- 
ory have proved immensely fruitful. A natural but far reaching extension 
of their scope can be found in Abraham Wald's theory of statistical de- 
cision functions." (Neyman, 1966, p. vii) [emphasis added] 

In contrast, R. A. Fisher (1955) rejected the decision-theoretic perspective, 
claiming that it seriously distorts his rendering of frequentist statistics: 

"The attempt to reinterpret the common tests of significance used in 
scientific research as though they constituted some kind of acceptance 
procedure and led to 'decisions' in Wald's sense, originated in several 
misapprehensions and has led, apparently, to several more." (p. 69) 

The primary aim of this paper is to take a closer look at the decision-theoretic 
framing in order to evaluate the extent to which it provides an appropriate 
framework for comparing the frequentist and Bayesian approaches. It is argued 
that the decision-theoretic terminology only glosses over the fundamental differ- 
ences in the underlying reasoning of the two approaches and gives rise to several 
misleading interpretations and conclusions. In a nutshell, frequentist inference 
is germane to scientific inference and the decision theoretic framing is germane 
to what Fisher (1955) called "acceptance sampling", where the loss function 
emanates from information 'other than the data'. 
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The paper argues that the decision theoretic perspective misrepresents the 
frcquentist viewpoint for two interrelated reasons: (a) the decision theoretic 
framing is at odds with both the primary objective and the reasoning underlying 
frcquentist inference, and (b) the notions of a risk function and admissibility are 
inappropriate for frcquentist inference because they do not represent legitimate 
error probabilities. The primary objective of frequentist inference is to learn 
from data Xo about the 'true' generating mechanism, described in terms of a 
particular (true) value 0* of 0, as it relates to the underlying statistical model 
A4e(x), 0G0; 9 denotes the unknown parameter(s) and the parameter space. 
The reasoning underlying frcquentist inference takes two different forms, factual 
(under the true state of nature) and hypothetical (what if 6* is equal to Oq). In 
contrast, the notions of a loss function and admissibility do not depend on 9* , 
but are concerned with all possible values of 0€0. Conflating the two has led 
to numerous misinterpretations in the statistical literature, including mixing 
the expected loss and MSE with legitimate error probabilities, ignoring the 
fact that the latter are always attached to the inference procedures themselves, 
but the former to all values of 0G0. This confusion also undermines a widely 
held standpoint that the way to generate good statistical procedures is to find 
the Bayes solution to an inference problem using a 'reasonable' prior and then 
examine its frequentist properties to see whether it is satisfactory from the latter 
viewpoint; see Rubin (1984), Gelman et al (2004). 

2 Decision-theoretic set up 

It is generally accepted that the decision-theoretic framework has four basic 
elements: 

"1. The space A of actions available to the statistician. 

2. The space of states of the world, or states of nature. One of these 
is the "true" state, but the statistician does not know which one. The 
space is also called the parameter space. 

3. The loss function L(6,a), representing the numerical loss to the 
statistician if he takes action aeA, when the true state of nature is 

Bee. 

4. An experiment yielding observations X, the distribution of which 
depends on the true state of nature, and which hopefully will help the 
statistician to reduce his loss." (Ferguson, 1976, p. 336) 

The frequentist, Bayesian and the decision-theoretic approaches share the 
notion of a (parametric) statistical model, stemming from elements 2 and 
4. That is, all three approaches begin by viewing data x :—(xi, x n ) as a 
realization of a sample X:=(Xl, X n ) from a prespecified statistical model 
Aie(x), generically specified by: 

Me{x)={f{x;6), 6>e0}, xeM^, for 0G0cM m , m<n, (1) 

where /(x; 9) denotes the (joint) distribution of the sample X. 
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In a decision-theoretic framework the loss function L(9, 0(X)) can take sev- 
eral functional forms (table 1); see Wasserman, 2004, p. 193. The key differences 
between the three approaches is that: 

(a) the frequentist approach relies exclusively on A4e(x), 

(b) the decision-theoretic framing allows for an action (decision) space that 
can be different from and adds a loss (or utility) function L(0, h(x)), 
for all 6e@ and xeR£, and 

(c) the Bayesian approach adds a prior distribution: 7r(0), for all 0G0. 



Table 1: Decision theoretic Loss Functions 


Square loss: 


L 2 (?(X);0)=(?(X) 




Absolute loss: 


L!(?(X);0)=|?(X) 


-o\ 


L p loss: 


L p (?(X);0)=|?(X) 


-e\p 


Zero-one loss: 


L -i(M(X))= j 


if ?(X) = 9 

1 if ?(X) 7^ 6> 


Kullback-Leibler: 


L KL (9(X);9)= J 


^ln(Xgg)/(x;^x 

X 



The apparent accommodation of both the frequentist and Bayesian ap- 
proaches stems from the fact that the loss function depends on both the sample 
and parameter spaces via the two quantifiers 'for all 0G0 and all xGM^-'. The 
quantifier 'for all xeK^-' is deemed to create an affinity with the frequentist ap- 
proach. The universal quantifier 'for all €0' creates a similar affinity with the 
Bayesian approach because it is a key component of the posterior distribution: 
tt(0|x o ) oc tt(0) • /(x o ]0), for all 0e9, 

on the basis of which Bayesian inferences are framed. Indeed, the affinity 
between the decision-theoretic and Bayesian perspectives does not end there. 
When Bayesians claim that all the relevant information for any inference con- 
cerning 9 is given by 7r(0|x o ) they only admit to half the truth. The other half 
is that for selecting a Bayesian 'optimal' estimator of one needs to invoke ad- 
ditional information like a loss (or utility) function L(9(X.),9). An appropriate 
Bayes estimator is usually selected by minimizing the posterior risk: 

R v (9, 9) = J eee L(?(X), 0)7r(0|xo)d0. 
The loss function plays a crucial role is selecting an optimal estimator because: 

(i) when L 2 (9, 9)={9 — 9) 2 the Bayes estimator 9 is the mean of ir(9\x ), 

(ii) when L\(6,6)=\6 — 9\ the Bayes estimator 9 is the median of 7r(#|x ), 

f for 9~ 9 

(iii) when L -i(9, 9)=5(9, 9), where S(.)= < ^ , the Bayes estimator 
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9 is the mode of 7r(#|x ); note that for purely mathematical reasons S(.) is often 
written: 





for 


e-e 


< e 


{! 


for 


e-e 


>£ ' 



, for some small e > 0; see Schervish (1995). 



s(e,0)=- 

To render the notion of a loss function operational one needs to deal with 
the two quantifiers 'for all 0e0' and 'for all x€R^'. To eliminate the latter 
quantifier the decision-theoretic approach takes expectations with respect to 
/(x; 0) for all xgR^- to reduce it to a single number, the mean of the loss 
function L(9,9), known as the risk function: 

R(9,9)=E* [L(0,?(X))] = J xeR „ L(0,?(x))/(x; 0)dx, for all 9eQ, (2) 

which is now only a function of #€©. In practice, the most widely used loss 
function is the square, whose risk function is known as the Mean Square Error 
(MSE): 

fl(0,?)=MSE(?(X); 0)=E(?(X) - 6>) 2 , for all 6»G0. (3) 

From a decision-theoretic perspective a minimal property for an estimator 
is considered to be admissibility. An estimator 0(X) is inadmissible if there 
exists another estimator #(X) such that: 

R(e,d) < R(9,0) for all 0e0, (4) 

and the strict inequality (<) holds for at least one value of 0. Otherwise, 0(X) 
is said to be admissible with respect to the loss function L(9, e). 

Having eliminated the quantifier 'for all xGK^', one needs to deal with the 
quantifier 'for all 060'. This is because risk functions often intersect, rendering 
one estimator better than another for certain values of 0iC0, but worse for 
other values 0€0— ©i. The two most widely used such reductions are: 

Maximum risk: i? max (#)=supi?(0, e) 

eee 

Bayes risk: R B (0)= J 9e@ R(6,6)ir(8)de 

where ir(0) denotes the prior distribution of e. 

Having reduced the risk function from 'all #<G©' down to a scalar, the obvious 
way to choose among different estimators is to find the ones that minimize this 
scalar with respect to all possible estimators: (?(.): — > 0. 
Such a minimization gives rise to the two widely used decision rules: 



Minimax rule: inf i? max (#)=hrf 

0(X) 0(X) 



supR(e, e) 

.060 



Bayes rule: mi R B {6)=mi L (zp .R(9,6)n(6)de 

0(X) 0(X) 

where the infimum is over all possible estimators 0(X). 
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Taking admissibility as the criterion for choosing among estimators, the main 
result concerns a Bayes rule #b(X) based on a prior ir(6). The result is that 
under the following regularity conditions: 

[i] R(6, 9) is a continuous function of 9 for every estimator 0(X), 

[ii] R(9,9 B ) < oo, 

[iii] 7r(0) has full support in the sense that w(9)d9 > for all 9gQ and 

£ > 0, 

the Bayes rule #b(X), based on a prior n(0), is admissible. 

The main result relating the two types of optimal estimators (decision rules) 
is that for a Bayes rule 0b (X), corresponding to some prior tt(9) is minimax 
if: 

[a] R(9, 9b) = c < oo, i.e. its risk function is constant, and 

[b] ?s(X) is admissible; see Wasserman (2004), p. 203. 

Taken together the above results have led to the following widely accepted 
standpoint in statistics. 

Decision-theoretic/Bayesian claim. The decision-theoretic framing strongly 
suggests that the way to generate good (optimal) statistical procedures is to find 
the Bayes solution using a reasonable prior and then examine its frequentist 
properties to see whether it is satisfactory from the latter viewpoint. 

The quintessential example that has bolstered the appeal of the above claim 
is the James-Stein estimator (Efron and Morris, 1973), that gave rise to a size- 
able literature on shrinkage estimators; see Salch (2006). 

3 Stein's paradox 

Consider the case of an independent sample ~K.:—(Xi,X 2 , X m ) from a Normal 

distribution: „„. . , „ „ 

X k - N\(O k ,<J 2 ), fc=l,2,...,m, 

where a 2 is known. Using the notation 0:—(9i,9 2 , 9 m ) and I m :=diag(l, 1, 1), 

this can be denoted by: „,.„ „ . 

X^ U(0,a 2 \ m ). 

The primary aim is to find a good estimator 0(X) of 6 where its 'optimality' is 
assessed in terms of the square (Euclidean) loss function: 

L 2 (0,0(x))=(l|0(x) - e\\ 2 ) = Er=i(^(x) - o k ) 2 . (5) 

Stein (1955) astounded the statistical world by showing that for m=2 the Least- 
Squares estimator #ls(X)= X is admissible, but for m > 2 is inadmissible. 
Indeed, James and Stein (1961) were able to come up with a nonlinear estimator: 



^(X)= (i - x, 



referred to as the James-Stein estimator that dominates # is (X)= X in terms 
of the MSE criterion: 

MSE(0(X);,0)=£ (l 2 (0,?(X))) , for all 0eR m , 
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by demonstrating that: 



MSE(0 j S (X);0) < MSE(0 LS (X);0), for all 0eM m 



(6) 



It turns out that 0/Af(X) is also inadmissible and dominated by the modified 
Jamcs-Stcin estimator that is admissible: 



where (z) = max(0, z); see Wasserman (2004). 

The traditional interpretation of this result is that when the mean 9:={6\, 62, m ), 
for m > 2, from a Normal, Independent sample X are the unknown parameters 
of interest, the James-Stein estimator reduces their overall MSE by using a 
combined nonlinear estimator as opposed to the linear Least-Squares estimator, 
which is inadmissible. In contrast, when each parameter is estimated separately, 
the least squares (LS) estimator is admissible. This result seem to imply that 
one will 'do better' (in expected loss terms) by using a combined nonlinear 
(shrinkage) estimator, instead of estimating these means separately. What is 
surprising about this result is that there is no statistical reason to connect the 
inferences pertaining to the different individual means, and yet the obvious es- 
timator (LS) is inadmissible. As argued below, contrary to the conventional 
wisdom this calls into question the appropriateness of the notions of a loss func- 
tion and admissibility, and not the judiciousness of frequentist estimation. 

4 Risk functions and acceptance sampling 

Despite the apparent affinity between the decision-theoretic set up and the 
Neyman-Pearson (N-P) 'accept /reject' rules, a closer look reveals that it is actu- 
ally at odds with the primary objective and the inductive reasoning underlying 
frequentist inference in general and N-P testing in particular. 

4.1 Where do loss functions come from? 

A closer scrutiny of the decision-theoretic set up reveals that the loss function 
needs to invoke 'information from sources other than the data', which is usually 
not readily available. Indeed, such information is available in very restrictive 
situations, such as acceptance sampling in quality control. In light of that, 
a proper understanding of the intended scope of statistical inference calls for 
distinguishing the special cases where the loss function is part and parcel of the 
available substantive information from those that no such information is either 
relevant or available. 

As Fisher (1935) warned several decades ago: 

"In the field of pure research no assessment of the cost of wrong conclu- 
sions, or of delay in arriving at more correct conclusions can conceivably 
be more than a pretence, and in any case such an assessment would 
be inadmissible and irrelevant in judging the state of the scientific evi- 
dence." (pp. 25-26) 

More recently, Tiao and Box (1975), p. 624, reiterated Fisher's distinction: 
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"Now it is undoubtly true that on the one hand that situations exist 
where the loss function is at least approximately known (for example 
certain problems in business) and sampling inspection are of this sort. 
... On the other hand, a vast number of inferential problems occur, 
particularly in the analysis of scientific data, where there is no way of 
knowing in advance to what use the results of research will subsequently 
be put." 

Cox (1978), p. 45, went further and questioned this framing even in cases 
where the inference might involve a decision: 

"The reasons that the detailed techniques [of the decision-theoretic 
approach] seem of fairly limited applicability, even when a fairly clearcut 
decision element is involved, may be 

(i) that, except in such fields as control theory and acceptance sam- 
pling, a major contribution of statistical technique is in presenting the 
evidence in incisive form for discussion, rather than in providing me- 
chanical presentation for the final decision. This is especially the case 
when a single major decision is involved. 

(ii) The central difficulty may be in formulating the elements required 
for the quantitative analysis, rather than in combining these elements 
via a decision rule." 

Even current textbooks framed around the decision-theoretic set up admit 
the difficulty of specifying a loss function: 

"The actual determination of the loss function is often awkward in 
practice, in particular because the determination of the consequences 
of each action for each value of 9 is usually impossible when D or 9 are 
large sets, for instance when they have an infinite number of elements." 
(p. 52) 

Indeed, the determination of the loss function is always awkward since both 
sets T> and 9 are usually infinite. Moreover, when one focuses on the analysis of 
scientific data, the use of loss functions can give rise to misleading impressions 
of affinity, similarity and analogy. For instance, it is widely accepted that the 
expected loss (risk) represents a genuine frequentist error analogous to the type 
I and II error probabilities: 

"The loss function is supposed to evaluate the penalty (or error) L(6, d) 
associated with the decision d [in V] when the parameter takes value 9 
[in 9]." (Robert, 2007, p. 52) 

In what follows it is argued that such claims are misinformed. It is not obvi- 
ous why a loss function like the MSE(0(X); 0), evaluates 'errors' associated with 
the inherent capacity of an estimator 0(X) to pin-point the true 9. The dis- 
cussion demonstrates that the decision-theoretic framing has a lot more affinity 
with the Bayesian perspective than it seems at first sight, and some of Fisher's 
qualms are well-grounded. 
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4.2 'Nuts and bolts' vs. learning from data 

Let us bring out the key features of a situation where the above decision- 
theoretic set up makes perfectly good sense. This is the situation Fisher (1955) 
called acceptance sampling, such as an industrial production process where 
the objective is quality control, i.e. to make a decision pertaining to shipping 
sub-standard products (e.g. nuts and bolts) to a buyer using the expected 
loss/gain as the ultimate criterion. In such a context the MSE(#(X); 9), or some 
other risk function, are relevant because they evaluate genuine losses associated 
with a decision related to the choice of an estimate #(x ), say the cost of the 
observed percentage of defective products, but that has nothing to do with type 
I and II error probabilities. 

Acceptance sampling differs from the usual scientific context in two crucial 
respects: 

[a] The primary aim is to use statistical rules to guide actions astutely, e.g. 
use 0(xo) in order to minimize the expected loss associated with "a decision", 
and 

[b] The sagacity of all actions is determined by the respective 'losses' stem- 
ming from "relevant information other than the data" (Cox and Hinklcy, 1974, 
p. 251). 

The key difference between acceptance sampling and a scientific inquiry is 
that the primary objective of the latter is not to minimize expected loss (costs, 
utility) associated with different values of 0e9, but to use data x to learn 
about the 'true' model: 

7W*(x)={/(x;0*)}, xeM£, (7) 

where 9* denotes true value of 9 in 6, whatever that happens to be. The 
two situations are drastically different mainly because the key notion of a 'true 
0' calls into question the above acceptance sampling set up. Indeed, the loss 
function being defined for all 0g6, will usually penalize 9*, and there is no 
reason to believe that the 9 ranked lowest when minimizing the expected loss 
would coincide with 9* , unless by accident. 

Consider the case where acceptance sampling resembles hypothesis testing in 
so far as final products are randomly selected for inspection during the produc- 
tion process. In such a situation the main objective can be viewed as operational- 
izing the probabilities of false acceptance/rejection with a view to minimize the 
expected losses. The conventional wisdom has been that this situation is similar 
enough to Neyman-Pearson (N-P) testing to render the latter as the appropriate 
framing for the decision to ship this particular batch or not. However, a closer 
look at some of the examples used to illustrate such a situation (Silvey, 1975), 
reveals that the decisions are driven exclusively by the risk function and not by 
any aspiration to learn from data about the true 9* . For instance, N-P way of 
addressing the trade-off between the two types of error probabilities, fixing a to 
a small value and seek a test that minimizes the type II error probability, seems 
utterly irrelevant in such a context. One can easily think of a loss function 
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where the 'optimal' trade-off calls for a much larger type I than type II error 
probability. That is, in acceptance sampling: 

[c] The trade-off between the two types of error probabilities is determined 
by the risk function itself, and not by any attempt to learn from data about 9* . 

In light of the crucial differences [a]-[c], one can make a strong case that the 
objectives and the underlying reasoning of acceptance sampling are drastically 
different from those pertaining to a scientific context. 

4.3 Loss function vs. inherent distance function 

The notion of a loss function stemming from 'information other than the data' 
raises another source of potential conflict. This emanates from the fact that 
within each statistical model Mg(x) in (P) there exists an inherent statistical 
distance function, often relating to the score function, and thus on information 
contained in the data; see Casella and Berger (2002). 

It is well-known that when the distribution underlying A^e(x) is Normal, 
the inherent distance function for comparing estimators of the mean 8 is the 
square: 

ND@ n (x);o*) = (?„(x) - ey, 

evaluated at 8=8* , the 'true' 9 in 0. On the other hand, when the distribution is 
Laplace (see Shao, 2003) the relevant statistical distance function is the Absolute 
Distance: 

AD(8 n (X);8*) = \8 n (X)-8*\. 

Similarly, when the distribution underlying A^e(x) is Uniform, the inherent 
distance function is: 

SUP(8 n (X);8*) = sup |?„(x)-0*|. 

A key feature of all these distance functions is that they are defined in terms of 
8* , the true 8, whatever that value happens to be. In contrast, the traditional 
loss functions are defined for all possible values of 8gQ. 

The question that naturally arises is when it might make sense to ignore 
these inherent distance functions and compare estimators using an externally 
given loss function stemming from information other than the data. 

5 Frequentist inference and learning from data 

An important dimension of frequentist inference that has not been adequately 
appreciated in the statistics literature concerns its objectives and underlying 
reasoning. As mentioned above, its primary objective is to learn from data 
about the true model Al*(x)={/(x; 9*)}, xGM^-. The underlying reasoning 
comes in two alternative forms. For estimation and prediction, the reasoning is 
factual, but for hypothesis testing it is hypothetical. Let us elaborate on these 
issues. 
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5.1 Prequentist estimation 

The nature of frequentist reasoning underlying estimation is factual, in the 
sense the optimality of an estimator (its generic capacity to zero in on 9*) is 
appraised in terms of its sampling distribution evaluated under the True State 
of Nature (TSN), i.e. 



TSN: 9=8*, whatever 6>*e9 happens to be. 



The primary objective of frequentist inference, in general, is to learn from data 
Xo about the 'true' statistical data generating mechanism ([7]). Point estimators 
contribute to this objective by effectively pin-pointing 9* for all sample realiza- 
tions. Indeed, optimal properties like consistency, unbiasedness, full efficiency, 
sufficiency, etc. evaluate the generic capacity of #„(X) to zero in on 9*. Its 
effectiveness for different sample realizations xgK.^- is measured by its sampling 

distribution: ^ 

/(0 n (x);0*), forxeR^. 

A key feature of frequentist inference is that the sampling distribution of any 

statistic Y n =g(X.) (estimator, test, predictor) is derived via: 



F(y,9):=F(Y n <y;9) = Jj - j f(*8)dx. 

{x: g(x)<t; xSK^} 

Hence, the sampling distribution f(9 n (x);9*) of an estimator 6 n (X) is derived 
by integrating /(x;0*), i.e. evaluated under 8=9*. In contrast, in hypothesis 
testing the sampling distribution of a test statistic d(X) is derived via (JSJ by 
integrating /(x; 9) where 9 is given different hypothetical values under both the 
null and alternative hypotheses. 

For instance, strong consistency asserts that n (X) will zero-in on 9* with 
probability one as n — > oo : P( lim n (X) = 9*)=l. 

n— >oo 

That is, for a 'large enough' n, n (X) will pin-point 9* almost surely. Similarly, 
unbiasedness asserts that the sampling distribution of #„(X) has a mean equal 
to 9* : 

E(9 n (X))=9*. 

In this sense both of these optimal properties are defined at the point 9=9*, 
and not 'for all 9eQ\ Indeed, defining unbiasedness as: 

E(8 n (X))=9 for all 9e&, 

makes no sense in frequentist estimation. What is of interest for a frequentist 
is whether the sampling distribution of 9 n (X) has a mean equal to the true 
9* or not. Similarly, the appropriate frequentist definition of the MSE for an 
estimator is defined at the point 9=9*: 

MSE(9 n (X.);9*) = E{9 n (X) - 9*f, for a particular 9*eO. (9) 
This is, the only point at which the concept of bias makes sense is: 

Bias(9„ (X); 9*)=E{9 n (X.))-9* , for a particular 6»*e6. (10) 
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rendering the decomposition of the MSE defined at 8=8*: 

MSE{8(X);9*)=Var(8(X)) + [E(8 n (X))-8*} 2 , for a particular 8*e&. 

a meaningful measure of the dispersion of the sampling distribution of 8 n (X) 
around 8=8* . This is because the variance is defined as the variation around at 
the true mean 8* , whatever value that happens to be! This viewpoint goes back 
to Fisher (1920) in his discussion of two different estimators of <r=^JVar{X) 
that led him to the property of sufficiency. In contrast, the notion of dispersion 
around all possible value of 8 in 6, like ([3]), is meaningless for an estimator 
aiming to pin-point 9* . 

The above reasoning has nothing to do with the quantifier 'for all possible 
values of 060', despite claims made by numerous textbook writers: 

"The frequentist paradigm relies on this criterion [risk function] to com- 
pare estimators and, if possible, to select the best estimator, the reason- 
ing being that estimators are evaluated on their long-run performance 
for all possible values of the parameter 8." (Robert, 2007, p. 61) 

In terms of elementary logic, the confusion can be explained as the result of 
conflating two different quantifiers: 

(a) the universal 'for all #£©', denoted by V0€0, and 

(b) the existential quantifier, 'there exists a #*£0 such that', denoted by 
30*£0. 

This is exemplified by the two different definitions of the MSE: 

Decision-theoretic: V6>€0 : MSE(8 n (X); 8)=E(8 n {X)-8) 2 

(12) 

Frequentist: 38* €0 : MSE(8 n (X); 9*)=E(8 n (X)-8*) 2 

Hence, the apparent affinity between a square loss function and the dispersion of 
an estimator is illusory because the only relevant dispersion from the frequentist 
perspective is around the true value 8* . 

What is perhaps most surprising is that statistics textbooks adopt one or 
the other definition of unbiasedness and the MSE in (fT2"j) and ignore (or seem 
unaware) of the other. What is less surprisingly is that Bayesian textbook 
writers, like Robert (2007), Berger (1985) and Ghosh et al. (2006), invariably 
adopt the definition with the quantifier 'for all #€©'. 

A closer look at the decision-theoretic setup reveals that it would penalize 
the value 8=8* . The only loss function that could potentially avoid that problem 
is the zero-one function: 

f if?„(X) = 8* 
L -i(Mt.(X))= { ^ . (13) 

1 iif#„(x)^r 

However, (|13|) is non-operational in practice because 8* is the unknown of inter- 
est! To add insult to injury, this is often used as the justification for using the 
quantifier 'for all #€©'; see Robert (2007). This is clearly totally misinformed 
about frequentist inference procedures whose relevant error probabilities are 
ascertainable without any need to know 8*, because they are not attached to 
different values of 8, but to the inference procedures themselves. 
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5.2 Admissibility as a 'minimal' property 

The factual nature of frequentist reasoning in estimation also brings out the 
impertinence of the notion of admissibility stemming from its reliance on the 
quantifier 'for all 0&Q'. To see that more clearly let us consider the following 
example. 

Example. In the context of the simple Normal model in (|14l) , let us consider 
a MSE comparison between two estimators of 9: 

(i) the Maximum Likelihood Estimator (MLE): X n =± J2t=i X t> 

(ii) the 'crystalball' estimator: 6> cb (x)= 7405926, for all x€K£. 

It turns out that both estimators are admissible and thus equally acceptable on 
admissibility grounds. This surprising result stems primarily from the quantifier 
'for all 9e<d\ Indeed, for certain values of 9 close to 6 c b, say 9g(9 c i,±^), for 

< A < 1, the latter is 'better' than X„ since: 

MSE(X n ;9) = i > MSE{9 cb -9) < £ for 9e(9 cb ±^). 

Common sense suggests that if a criterion cannot distinguish between X n [a 
strongly consistent, unbiased, fully efficient and sufficient estimator] and an ar- 
bitrarily chosen real number that ignores the data altogether, it is practically 
useless for distinguishing between 'good' and 'bad' estimators in frequentist 
statistics. Moreover, it is obvious that the source of the problem is the quan- 
tifier V#€0. In contrast to admissibility, the property of consistency instantly 
eliminates the crystal ball estimator 9 c b- 

In light of the fact that the optimal properties of an estimator concern its 
generic capacity to zero- in on 9* , the relevant frequentist errors need to be 
associated with a particular inference procedure. The factual nature of the 
underlying frequentist estimation reasoning precludes any error probabilities 
associated with the direct inference 9(x.q)=9* as illegitimate, because post-data 
0(xo) is either equal to 9* or not, and no non-degenerate probability can be 
attached to either of those two alternatives. 

5.3 James-Stein estimator: a frequentist perspective 

For a proper evaluation of the above James-Stein result, it is important to 
bring out the conflict between the overall MSE and the reasoning underlying 
frequentist estimation. When the James-Stein estimator is viewed from this 
frequentist perspective several issues arise. 

First, the James-Stein result ^ is practically useless because Ols (X) and 
6 js (X) are inconsistent estimators of since there is essentially one observation 
(Xfc) for each unknown parameter and as m — ¥ oo the number of unknown 
parameters increases at the same rate. To bring out the futility of comparing 
these two estimators more markedly, consider the following simpler example. 

Example. Let X:=(Xi, X2, X n ) be a sample from the simple Normal 

""" 1< l: X k - NHD(0,1), fc=l,2,...,n, for n > 2. (14) 
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Comparing the two estimators 6\=X n , #2 = 5 (^"l + X n ) and inferring that 82 
is relatively more efficient than 6\ since: 

MSE(? 2 (X);0)=1 < MSE(?i(X);0) = ±, for all 0eR, 

is totally uninteresting because both estimators are practically useless. This 
is because in frequentist estimation the minimal property for estimators is not 
admissibility but consistency, on the basis of which both of these estimators will 
be excluded from consideration. Indeed, no frequentist would seriously propose 
61 or 62 as sensible estimators. 

In light of that, a way to render the above Stein paradox potentially inter- 
esting from the frequentist perspective is to use panel (longitudinal) data where 
the sample takes the form: 

Xt:=(Xu, X2t, X mt ), t=l, 2, n. 

In this case the Least-Squares and James-Stein estimators take the form: 

^ n 

e LS {X)=(X 1 ,X 2 ,...,X m ), where X k =\ £ X kt , fc=l,2,...,m, 

6 + JS (X)= (l - (2g^) + X, where X:= (X 1 ,X 2 , ...,X m ) . 

Second, the notion of "better" in the James-Stein result needs to be evaluated 
more critically. It is clear that the James-Stein loss function in (|S|) introduces 
a trade-off between the accuracy of the estimators of individual parameters 
02, m ) and the overall accuracy in the sense that the increase in the 
latter is at the expense of former. Hence, the James-Stein result raises a key 
question: 'in what sense the overall MSE among a group of estimated means 
based on statistically independent processes provides a better measure of 'error' 
in learning about the true means?' The short answer is that it doesn't. Indeed, 
the overall MSE will not be the relevant statistical error when the primary 
objective of estimation is to learn from data about 6* , the true value of 0; the 
one that generated the data in question. Having said that, such an expected loss 
might be relevant for substantive purposes when the underlying components of 
the vector stochastic process {X t , t€N=(l, 2, ...)} are related in a substantive 
sense via some extraneous loss function. For learning purposes, however, the 
two objectives should be kept separate because they are promoting very different 
objectives. 

Third, the key concept underlying the James-Stein result, that of admissibil- 
ity with respect to a particular loss function, seems inappropriate for frequentist 
inference in general and optimal estimation in particular. The conflict arises be- 
cause the primary aim and nature of reasoning underlying frequentist inference 
in general is at odds with the quantifier 'for all 0€0' underlying these concepts. 
There is nothing in the notion of admissibility that promotes learning from data 
about 0* , or calibrating the procedure's capacity to achieve that aim. On the 
contrary, it treats all possible values of in on par. 

Fourth, the evaluation of the overall MSE that depend on extraneous in- 
formation can be both awkward as well as highly misleading in practice. To 
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bring out the difficulties, let us take an example from economics, a field where 
loss functions supposedly arise naturally. Consider the simple linear regression 
model: 



y t = p + p lXlt + P 2 x 2t + /3 3 x 3t + u t , u t - NIID(0, a 2 ), t=l, 2, n, 

where the unknown parameters of interest are 0: — (f3, a 2 ), /3:=(f3 0) p lt j3 2 , /3 3 ). 
A moment's reflection suggests that serious practical difficulties are raised by 
the mathematical structure of a loss function such as that of Stein: 



where Z:=(y, X), and the James-Stein estimator takes the form: 

3 JS (Z)= (l- ^-^-^ j 3, for c > 0, 3=(X T X)- 1 X T y. 

The crucial source of the problem is that in a decision-theoretic context £2 (A /3(Z)) 
is treated as a unitless numerical measure of how costly are the various conse- 
quences of potential decisions associated with (3(zq). However, it is well-known 
that the regression coefficients are not unitless; fi i depends crucially on the units 
of measurement of both y t and Xj+, i=l, 2, 3. Worse, in practice such coefficients 
vary greatly in magnitude, say /3 1 (z )=1.8, and /? 3 (z )= —.004, rendering the 
smaller coefficient estimates more or less irrelevant for cost purposes because 
their relative contribution in (|15[) will be miniscule. Moreover, one can change 
the cost associated with any coefficient by changing the units of measurement 
of any of the variables involved, which in the case of economics it will be trivial 
to do. Such changes in the units of measurement will change drastically the 
ranking of different potential decisions. 

In summary, the above example raises serious practical questions about how 
the loss function machinery can be implemented in practice to render the ex- 
pected loss associated with (3(zq) for different values of /3 meaningful. In par- 
ticular, two practical questions arise: 

(i) where does the extraneous information concerning costs associated with 
parameter values come from? and 

(ii) how does one select the functional form of the loss function to avoid the 
serious unit of measurement problems raised above? 

5.4 Confidence Interval Estimation 

To bring out the frequentist reasoning underlying Confidence Interval (CI) es- 
timation, let us return to the simple Normal model in ([M)) and have a closer 
look at the sampling distribution of a good estimator, X n , [consistent, unbiased, 
fully efficient, sufficient] often stated as: 



What is not usually explicitly revealed is that the evaluation of that distribution 
is factual, i.e. under the True State of Nature (TSN) , 9=9* , and denoted by: 



L 2 (f3, 3(Z))=(||3(Z) - fSf) = £Lo&(Z) " ?k)\ 



(15) 



X n -N(0,i). 



(16) 



X 



e 



e 




n 
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What is remarkable about this result is that when X n is standardized to define 
the pivotal function: 

d(X;9):=^i (X n - 9*) 9= S N(0, 1), (17) 

one is certain that (|17j) holds only for the true 9* and no other value. For any 
other value of 9, say 9\^9* , the same evaluation will yield: 

d(x ; 0) ^ n(*i, i), s 1 =^i(e 1 -e*) . 

The factual reasoning result in (|17j) provides the basis for constructing the (1— a) 
Confidence Interval (CI): 

F(x m -c^)<0<X n + c § (^);e=9*)=l-a, (18) 



which asserts that the random interval 



will cover 



(overlay) the true mean 0*, whatever that happens to be, with probability (1— a), 
or equivalently, the error of coverage is a. Hence, frequentist estimation the 
coverage error probability depends only on the sampling distribution of X n and 
is attached to random interval for all values 9^9* without requiring one to know 
9*. 

The factual reasoning underlying estimation renders the post-data coverage 
error probability degenerate since the TSN has played out and the observed CI 
\x n — est (^7=)) x n + c« (t=)] either includes or excludes 9* , but there is no way 
to know. That is, there is no non-degenerate post-data error probability one can 
attach to different values of 9 within the observed interval. The same factual 
reasoning undermines any attempt to use 9(xq)=9* as a legitimate inference 
result. 

Any attempt by Bayesians (see Robert, 2007) to present various erroneous 
interpretations of frequentist error probabilities by practitioners as evidence that 
favors the Bayesian reasoning as being more intuitive is totally misplaced. A 
more convincing explanation is that such misinterpretations linger on in the 
statistical literature since the pre-Fisher era, where the 'probable error' inter- 
val was given in terms of an inverse-probability (Bayesian) interpretation; see 
Bowley (1937), Mills (1938). 

5.5 Frequentist Hypothesis testing 

Another frequentist inference procedure one can employ to learn from data 
about 9* is hypothesis testing where the question posed is whether 9* is close 
enough to some prespecified value 9q. 

In contrast to estimation, the reasoning underlying frequentist testing is 
hypothetical in nature. For testing the hypotheses: 

Hq: 9 = 6*o vs. Hi: 9 > 9q, where 9o is a prespecified value, 

one returns to the same sampling distribution in (|16p , but transforms the pivotal 
quantity in into the test statistic by replacing 9* with the prespecified value 
9q, yielding d(X):=y / n (X n — 6q) . However, instead of evaluating it under the 
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TSN, it is now evaluated under various hypothetical scenarios associated with 
Ho and Hi to yield two types of (hypothetical) sampling distributions: 

(I) d(X):^(I u -e ) ^°N(0,1), 

(II) d(X):=V^(X„-0 o ) N(5i,l), <5i=0I(#i-#o) for 0i > 0„. 

In both cases (I)- (II) the underlying reasoning is hypothetical in the sense that 
the TSN in (fTT)l is replaced by hypothesized values of 6, and the test statistic 
provides a distance between the hypothesized values and 0* the true 0, assumed 
to underlie the generation of the data xq, yielding d(xo). Using the sampling 
distribution in (I) one can define the following legitimate error probabilities: 

significance level: P(d(X) > c a ;Ho) = a, 

(19) 

p-value: P(d(X) > d(x ); iT )=p(x ). 

Using the sampling distribution in (II) one can define: 

type II error prob.: P(d(X) < c a ;0=0 1 )=/3(0 1 ), for 1 > O , 

power: P(d(X) > c a ;0=0 1 )=g(0 1 ), for 6»i > O . 

It can be shown that the test T a , defined by the test statistic d(X) and the 
rejection region Ci(a)={x :d(x) > c Q }, constitutes a Uniformly Most Powerful 
(UMP) test for significance level a; see Lehmann (1959). The type I [II] error 
probability is associated with test T a erroneously rejecting [accepting] H . The 
type I and II error probabilities evaluate the generic capacity [whatever the 
sample realization xeM™] of a test to reach correct inferences. Contrary to 
Bayesian claims, these error probabilities have nothing to do with the temporal 
or the physical dimension of the long-run metaphor associated with repeated 
samples. The relevant feature of the long-run metaphor is the repeatability 
(in principle) of the DGM represented by M.e(x), A feature that can be easily 
operationalized using computer simulation; see Spanos (2012c). 

The key difference between the significance level a and the p-value is that 
the former is a pre- data and the latter a post- data error probability. Indeed, the 
p-value can be viewed as the smallest significance level a at which Hq would 
have been rejected with data Xo. The legitimacy of post-data error probabil- 
ities underlying the hypothetical reasoning can be used to go beyond the N-P 
accept/reject rules and provide an evidential interpretation pertaining to the 
discrepancy from the null warranted by data xo. This is achieved using the 
post-data severity evaluation reasoning which can be used to address the 
fallacies of acceptance and rejection, as well as shed light on several confusions 
in frequentist inference; see Mayo (1996), Mayo and Spanos (2006; 2011). In 
relation to this it is important to note that the overwhelming majority of these 
confusions have been introduced into frequentist inference by Bayesians by de- 
ploying rigged examples; see Spanos (2010; 2011a-b; 2012a-d). 

Despite the fact that frequentist testing uses hypothetical reasoning, its main 
objective is also to learn from data about the true model A^*(x)={/(x; 0*)}, xeR' 
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There is a modicum of truth in the usual textbook claim that a test statistic 
provides a measure of disagreement (discordance) between the data x and 
the hypothesized 9q, even though the claim is somewhat misleading because 
it compares apples and oranges; 9 lives in 6 and x in R^-, respectively. A 
more appropriate way to frame this claim, however, is that a test statistic like 
d(X.):=y/n (X n — 6> ) constitutes nothing more than a scaled distance between 
9* [the value behind the generation of X n ], and a hypothesized value 9q. This 
stems from the fact that frequentist inference assumes that the data Xo have 
been generated by M*(x). 

6 Is expected loss a legitimate frequentist error? 

The question that naturally arises at this stage is 'what do the above frequen- 
tist error probabilities, the type I and II, the p-value and the coverage error 
probability, have in common?' 

First, they all stem directly from the statistical model 7We(x) since the un- 
derlying sampling distributions of estimators, test statistics and predictors are 
derived exclusively from the distribution of the sample /(x;0) via (|8]). In this 
sense, the relevant error probabilities are directly related to statistical informa- 
tion pertaining to the data as summarized by the statistical model A^e(x) itself. 
In this sense, they have nothing to do with ad hoc loss [cost, utility] functions 
based on extraneous information 'other than the data'. 

Second, all these error probabilities are attached to a particular frequentist 
inference procedure as they related to a relevant inferential claim. These error 
probabilities calibrate the effectiveness of inference procedures in learning from 
data about the true statistical model M.* (x)={/(x; 8*)}, xeR^-. It is important 
to emphasize that 'truth' in this context refers to statistical, and not substantive, 
adequacy, i.e. A^*(x) could have generated the data Xo in question in so far as 
Xo represents a 'truly typical realization' of the stochastic process {X t , tEN} 
underlying A^e(x), with the 'typicality' being testable vis-a-vis the data Xo. 

In light of these features, the question is: 'how do the risk comparisons of the 
decision-theoretic perspective relate to these frequentist error probabilities?' or 
'in what sense a risk function defined by ([2]) could potentially represent relevant 
frequentist errors?' According to some Bayesians (see Robert, 2007), the risk 
function does represent a legitimate frequentist error because it is derived by 
taking expectations with respect to /(x; 9), xGR^. This argument is misleading 
for several reasons. 

(a) The expected losses stemming from the risk function R(9, 9) are attached 
to particular values of 9 in 0, including 9* . This assignment is in direct conflict 
with all the above legitimate error probabilities that are attached to the inference 
procedure itself, and never to the particular values of 9 in 0. The expected 
loss assigned to each value of 9 in has nothing to do with learning from 
data about 9*. Indeed, the risk function will penalize a procedure for pin- 
pointing 9*\ Granted, the 'crystal ball' estimate 9(xq)=9^, for a prespecified 
value 0* in 0, can be a legitimate decision-theoretic rule as well as a legitimate 
Bayesian inference with its associated degree of belief, but it is never a legitimate 
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frcqucntist inference. In this sense expected losses can be useful in other contexts 
such as 'acceptance sampling', where the objective of the inference is driven by 
the risk function. 

(b) The second difficulty with the above claim is that a quantity cannot 
be rendered meaningful or relevant for frequentist inference just because it is 
defined by taking expectations over all x€lR^. Indeed, in a decision-theoretic 
framework, the dependence of the loss function on xelR^- is treated a nuisance 
that is addressed by taking expectations with respect to /(x;#), so that the 
risk function involves only 9&Q. This is very different from expectations with 
respect to the sampling distribution of an estimator 6> n (X), i.e. E(9 n (K))=9* , 
since the latter pertains to the true value 9* . Indeed, comparing the expected 
loss to the above legitimate error probabilities it becomes clear that any loss 
function-based evaluations that depend on extraneous information can be both 
awkward as well as highly misleading in practice. 

In light of the above discussion, it is not a coincidence that textbooks written 
by Bayesian statisticians extol the virtues of the decision-theoretic perspective 
and then proceed to present the Bayesian approach as its natural extension; see 
Berger (1985), Bernardo and Smith (2000), Ghosh et al (2006), Robert (2007), 
Schervish (1995) inter alia. What makes the Bayesian case against frcqucntist 
inference misplaced is its conflating of the universal (for all #G0) with the 
existential (there exists a #*€0 such that) quantifiers, and then charging the 
frcqucntists with fallacious results stemming from the very confusion permeating 
the Bayesian claims. 

7 Summary and conclusions 

The above discussion called into question the claim that decision theory provides 
a unifying framework for comparing the frequentist and Bayesian approaches to 
inference by using a common terminology based on decision rules, action spaces, 
loss and risk functions, admissibility, etc. It is argued that a closer look reveals 
that the decision-theoretic perspective distorts frequentist inference for two main 
reasons. 

First, the quantifier 'for all #<E0' is inappropriate for evaluating frequentist 
inference procedures because their primary objective is to learn from data about 
the true value 9*\ What matters for a good frequentist procedure is not its 
behavior for all possible values 0£0, but how well it does in shedding light 
on the true value 0*e0. This capacity to learn from data is what legitimate 
frcqucntist error probabilities are calibrating. They do that by assigning error 
probabilities to the inference procedures themselves, and not to different values 
of 6 in 9. 

Second, in light of the inappropriateness of the universal quantifier, the risk 
function R(9, 9) does not give rise to any relevant errors pertaining to frequentist 
inference because its attribution of expected losses to different values of 9 in 
has nothing to do with learning from data about 9=9* . Instead, it is relevant 
for evaluating expected losses in situations like acceptance sampling where the 
loss function is based on (cost, utility) information other than the data. Fisher 
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(1955) was correct in claiming that the latter scenario is atypical of statistical 
modeling in a scientific context, and the decision-theoretic perspective distorts 
frcquentist inference because the objectives of inference in the two cases are at 
odds with each other. 

The combination of the inappropriateness of admissibility and the irrelevance 
of extraneous (other than the data) loss information when the primary objective 
is learning about the true state of nature (0=0*), calls into question: 

(i) the appropriateness of the decision-theoretic set up for comparing the 
frequentist and Bayesian approaches, 

(ii) the relevance and appropriateness of the James-Stein risk 'optimality', 

and 

(iii) the standpoint that a way to generate good statistical procedures is to 
find the Bayes solution for a particular risk function using a reasonable prior 
and then examine its frcquentist properties to see whether it is satisfactory from 
the latter viewpoint. 
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